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ON  THE  SIMULTANEOUS  CLASSIFICATION  OF  SEVERAL  INDIVIDUALS 


1.  Introduction. 

It  has  been  usual  in  the  study  of  classification 
problems  to  consider  the  classification  of  one  item  at  a 
time.  However,  in  practice  one  frequently  deals  with  a  whole 
group  of  items  each  of  which  has  to  be  assigned  to  its  proper 
category.  There  seem  to  be  two  main  reasons  why  it  is  worth 
considering  the  problem  in  this  more  general  form. 

First,  one  may  gain  in  efficiency.  This  happens, 
roughly  speaking,  because  one  can  utilize  the  totality  of  ob¬ 
servations  to  obtain  estimates  of  unknown  parameters.  For 
certain  problems  this  has  always  been  realized  and  procedures 
obtained  in  this  manner  have  been  considered  in  the  literature. 
For  certain  other  types  it  was  pointed  out  first  by  Robbins 
[l],  and  another  example  was  given  by  Levene  [2]. 

The  other  important  reason  for  considering  several 
items  simultaneously  is  that  one  is  thus  led  to  new  formula¬ 
tions.  In  particular,  -problems  arise  where  the  definition  of 
the  various  categories  is  given  not  absolutely  but  in  terms 
of  the  other  items  in  the  group.  A  suggestive  although  not 
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too  typical  example  is  the  assigning  of  grades  in  an  examina¬ 
tion  or  course* 

Throughout  the  present  discussion  we  shall  assume 
that  the  items  to  be  classified  are  all  of  one  kind,  and  that 
all  of  them  are  to  be  distributed  among  the  same  categories. 

Under  these  circumstances  it  is  frequently  reasonable  to 
measure  the  loss  resulting  from  wrong  decisions  by  the  number 
Of  incorrect  classifications.  (By  ''risk"  we  shall  then  mean 
the  expected  number  of  wrong  classifications.)  This  is  of 
course  nearly  always  a  considerable  oversimplification.  How¬ 
ever,  it  is  necessary  to  work  with  standardized  loss  functions, 
and  the  one  suggested  does  have  a  concrete  and  intuitively  ap¬ 
pealing  significance.  It  is  comparable  with  the  "simple*1  loss 
functions  utilized  by  Wald  and  with  the  formulation  of  hypothesis 
testing  in  terms  of  the  probabilities  of  errors  given  by  Neyman 
and  Pearson. 

For  our  purpose  it  is  important  to  distinguish  classi¬ 
fication  problems  according  to  the  nature  of  the  items  to  be 
classified.  These  may  be  on  the  one  hand  students,  prospective 
doctors,  skulls,  plants,  etc.  On  the  other  hand  they  may  be 
varieties  of  wheat,  different  production  processes  or  treat¬ 
ments  of  a  disease.  The  division  of  course  is  not  completely 
clear-cut.  However,  in  the  cases  listed  second  one  classifies 
on  the  basis  of  independent,  identically  distributed  variables; 
that  is,  one  is  dealing  with  (statistical)  populations  and 
makes  the  classification  on  the  basis  of  a  random  sample  from 
the  population. 
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This  is  usually  not  the  case  in  the  group  of  problems 
listed  first.  There  the  basis  of  classification  is  a  set  of 
measurements,  one  each  of  a  number  of  different  characteris¬ 
tics.  The  usual  assumption  attributes  to  this  set  of  measure¬ 
ments  a  multivariate  normal  distribution.  This  assumption 
implies  that  the  item  being  measured  has  itself  been  obtained 
by  means  of  a  chance  mechanism  from  some  population  of  such 
items;  the  replication  of  the  experiment  consists  in  drawing 
another  batch  of  items  from  this  population.  Suppose  now  that 
it  is  desired  to  classify  the  items  each  into  one  of  a  number 
of  categories.  Then  each  item  of  the  total  population  falls 
into  one  of  these  categories;  the  number  in  the  categories  is 
in  certain  proportions.  It  follows  that  for  each  of  the  items 
to  be  classified  there  is  a  definite  probability  of  falling  in¬ 
to  each  of  the  categories.  While  thus  the  assumption  of  a  priori 
probabilities  for  the  various  categories  seems  inevitable  when 
one  is  classifying  individuals,  this  assumption  is  usually  in¬ 
appropriate  for  the  classification  of  populations. 

The  description  of  the  Items  in  the  one  case  as  in¬ 
dividuals,  in  the  other  as  populations  is  of  course  a  somewhat 
loose  one.  Thus,  for  example,  each  student  in  a  class  plays 
the  role  of  a  population  in  the  following  problem.  The  know¬ 
ledge  of  the  student  is  to  be  tested  by  a  true-false  examina¬ 
tion.  One  possible  measure  of  his  knowledge  is  the  proportion, 
among  the  totality  of  true-false  questions  that  could  be  asked, 
which  he  can  answer.  The  questions  that  are  asked  in  the  ex¬ 
amination  can  be  thought  of  (in  a  very  rough  approximation)  as 
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a  random  sample  from  the  totality  of  possible  questions.  Another 
example  of  this  kind  is  provided  when  we  attempt  to  determine  the 
genotype  of  an  individual  through  the  number  of  recessives  among 
his  offspring* 

However,  generally  speaking,  there  is  a  fairly  clear 
distinction  between  the  two  types  of  problems.  When  we  clas¬ 
sify  populations  we  are  dealing  with  samples  from  distribu¬ 
tions  with  unknown  but  fixed  parameters.  In  the  case  of  in¬ 
dividuals,  we  assume  multivariate  distributions  involving 
parameters  which  themselves  constitute  random  variables. 

There  is  a  further  difference  between  the  two  types 
of  problems,  which  is  not  of  a  theoretical  nature  but  which 
nevertheless  Is  of  some  importance.  Roughly  speaking,  and  ad¬ 
mitting  that  there  are  important  exceptions,  we  can  say  that 
the  simultaneous  classification  of  populations  usually  involves 
only  a  small  number,  say  2  to  10,  populations,  while  the  number 
of  Individuals  in  a  group  to  be  classified  frequently  is  con¬ 
siderably  higher.  Thus  it  is  of  interest  to  develop  an  asymp¬ 
totic  theory  for  the  classification  of  a  large  number  of  indi¬ 
viduals.  On  the  other  hand,  in  the  classification  of  populations 
It  is  usually  not  too  reasonable  to  assume  a  large  number  of 
them.  An  asymptotic  theory  here  would  more  likely  be  concerned 
with  large  samples  from  each  of  the  populations. 

Classification  problems  differ  not  only  through  the 
nature  of  the  objects  that  are  to  be  classified  but  also  in 
various  respects  according  to  the  categories  among  which  the 
items  are  to  be  distributed.  As  a  first  distinction,  we  may 
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either  be  dealing  with  k  clearly  distinct  categories  or  with 
classification  according  to  a  parameter  with  a  continuous  range 
of  variation  which  is  broken  up  (sometimes  somewhat  arbitrarily) 
to  provide  the  categories.  We  shall  here  concern  ourselves 
mainly  with  the  first  of  these  two  types  of  problems.  Although 
this  has  a  somewhat  restricted  range  of  applications  it  does 
arise  naturally  in  problems  of  taxonomy.  In  particular  one 
would  expect  it  to  be  of  increasing  importance  for  the  deter¬ 
mination  of  genotypes,  which  is  of  interest  in  genetical  and 
anthropological  research.  (See  for  example  [3]). 

For  the  problem  of  k  categories  corresponding  to 
k  distinct  values  of  a  characteristic  parameter,  there  is  a 
further  important  subdivision  according  to  the  amount  of  in¬ 
formation  that  is  available  concerning  these  categories.  In 
the  simplest  case  the  distribution  of  the  observable  random  vari¬ 
ables  is  known  in  each  of  the  categories,  except  possibly  for 
nuisance  parameters.  As  a  second  possibility  one  may  assume  in¬ 
stead  of  known  distributions  that  measurements  of  a  number  of 
individuals  of  known  category  (for  example  of  known  genotype 
with  respect  to  some  simple  gene)  are  available.  Thus  in  the 
case  k  =  2  we  may  have  a  number  of  Y's  and  Z's  and  want 
to  classify  an  X  as  belonging  to  the  same  category  as  either 
the  Y's  or  the  Z's.  For  this  problem  it  has  been  customary 

in  the  literature  to  consider  the  classification  only  of  a 
single  X.  However,  it  seems  clear  that  if  several  X's  are 
to  be  classified  the  procedure  can  be  much  improved. 
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This  becomes  particularly  clear  if  we  go  one  step 

further  and  assume  the  distributions  to  be  unknown  but  no 

Y's  and  Z*s  available  for  guidance.  Suppose,  for  example, 

that  X^, •••,XQ  are  each  known  to  come  from  one  of  two  normal 

distributions  with  E(Xj.)  =  ©i  equal  to  either  a  or  b  and 
2 

=  1,  but  that  a  and  b  are  unknown.  Suppose  further 

that  the  0's  are  independent  random  variables  P(G^  =  a)  =  p. 

It  is  then  clear  that  for  large  n  one  will  be  able  to  obtain 
good  estimates  of  a,  b,  p  and  hence  carry  out  a  reasonable 
classification  of  the  X*s.  The  problem  is  closely  related  to 
that  of  testing  for  outlying  observations,  which  however  is 
usually  treated  under  somewhat  different  assumptions  (see  for 
example  Grubbs  [4]  and  Dixon  [5] ) . 

In  the  present  paper  we  shall  assume  that  all  items 
with  whose  classification  we  are  concerned  are  to  be  classified 
simultaneously.  This  is  of  course  not  always  the  case.  Frequently 
the  classification  has  to  be  carried  out  serially.  It  seems 
likely  that  in  many  cases  the  optimum  serial  classification  pro¬ 
cedure  consists  in  classifying  7 r  on  the  basis  of  observations 

n 

on  7 r  ,  ••*,  as  if  the  problem  were  the  simultaneous  classi¬ 

fication  of  7T, ,  *  *  *,  TT  •  Hence  the  work  done  here  should  be 
1’  ’  n 

applicable  at  least  in  part  also  to  this  problem. 

In  the  present  paper  we  shall  consider  the  classifi¬ 
cation  of  individuals.  For  some  simple  problems  the  minimax 
procedures  are  obtained.  Since  they  become  asymptotically  in¬ 
admissible  as  the  number  of  individuals  gets  large  other  pro¬ 
cedures  are  given  that  in  the  limit  are  minimax  and  admissible. 
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In  a  second  paper  we  hope  to  treat  analogous  problems 
for  the  classification  of  populations.  However  here,  as  has 
already  been  pointed  out,  we  shall  keep  the  number  of  popula¬ 
tions  fixed  and  consider  the  case  of  large  samples  from  each 
of  the  populations.  This  problem  seems  to  be  very  much  harder  than 
the  one  treated  here,  and  it  is  not  too  clear  what  general  results 
to  expect. 

The  present  paper  and  the  projected  paper  on  the  classi¬ 
fication  of  populations  are  both  related  to  a  third  paper  on  the 
theory  of  selection.  In  the  first  two  papers  it  is  assumed 
throughout  that  the  categories  are  defined  in  absolute  terms. 

The  third  paper  constitutes  an  attempt  at  a  problem  in  which 
this  is  not  the  case.  It  is  concerned  with  classifying  each  of 
s  populations  as  good  or  bad,  where  a  population  is  defined  as 
good  if  its  quality  is  within  given  limits  of  that  of  the  best 
of  the  populations.  Although  the  theory  of  the  minimax  procedures 
is,  as  usual,  easy  (it  involves  an  extension  of  the  fundamental 
lemma  of  Neyman  and  Pearson  to  the  case  of  vectorvalued  critical 
functions)  the  application  to  particular  cases  presents  diffi¬ 
culties  which  the  author  has  not  been  able  to  overcome. 

I  should  like  to  acknowledge  my  indebtedness  to  Dr. 
Howard  Levene.  In  a  seminar  talk  he  presented  his  binomial  ex¬ 
ample  of  the  phenomenon  discovered  by  Robbins  and  contrasted 
this  with  the  minimax  procedure.  Vftiile  much  of  the  present 
paper  was  already  written  at  the  time,  Dr.  Levene’ s  remarks 
suggested  certain  extensions  of  the  work  in  progress. 
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A  complete  copy  of  Robbins'  paper  C  1  D  was  not 
available  to  me  until  after  the  present  paper  had  been 
completed.  In  [  1  1  Robbins  indicates  a  method  of  approach 
which  would  seem  to  yield  results,  analogous  to  the  ones 
obtained  here,  for  a  much  more  general  class  of  problems. 

2.  Eff^t^of^a^pr^OTi^^ob  ab  i  1  i  t  ie  s . 

One  of  the  main  characteristics  of  problems  con¬ 
cerning  the  classification  of  individuals  is  the  assumption 
of  sL  priori  probabilities  for  the  various  categories.  At 
first  thought  it  might  appear  that  the  problem  of  the  simul¬ 
taneous  classification  of  several  individuals,  as  compared 
with  the  classification  of  one  individual  at  a  time,  is  much 
complicated  by  the  fact  that  the  a  priori  probabilities  intro¬ 
duce  dependence.  Whether  or  not  this  is  so  depends,  as  has 
been  pointed  out  by  Mood  [7]  in  a  somewhat  different  context, 
on  the  procedure  by  which  the  individuals  being  classified  have 
been  obtained  from  the  population.  If  the  method  is  that  of 
random  sampling  there  is  no  dependence.  This  was  shown  by  Mood 
for  variables  taking  on  only  the  values  1  and  0  and  clearly 

holds  in  general.  For  let  X. ,•••, X  be  independently  and 

±7  7  n 

identically  distributed,  and  let  (ii, be  m  integer 
chosen  at  random  from  the  set  (1, •••,n).  Then  the  set  of  vari¬ 
ables  (Xj,  ,  •**,Xi  )  is  clearly  independent  of  the  set  of  re- 

1  m 

maining  X's. 

If  on  the  other  hand  the  group  being  classified  has 
been  obtained  by  some  other  method,  one  will  in  general  expect 
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dependence.  It  follows  that  for  the  problems  under  considera¬ 
tion  it  is  important  to  know  how  the  group  that  is  being  tested 
was  obtained. 

Vilhen  we  are  dealing  with  a  random  sample  from  a  population 
and  if  the  proportion  of  the  various  categories  in  the  popula¬ 
tion  are  known,  it  is  clear  from  the  above  remark  that  the 
best  method  of  simultaneous  classification  of  the  individuals 
in  the  sample  consists  in  simply  classifying  each  individual 
separately  as  best  as  possible  without  any  regard  to  the  re¬ 
mainder  of  the  sample. 

If,  however,  the  proportions  in  the  various  cate¬ 
gories  are  not  known  it  has  always  been  recognized  that  one 
should  estimate  these  proportions  from  the  sample.  (See  for 
example  [6].)  This  is  closely  connected  with  the  very  in¬ 
teresting  results  obtained  by  Robbins  DJ .  He  also  pointed 
out  that  in  the  problem  considered  by  him  the  minimax  solution 
makes  no  use  of  the  information  that  the  sample  contains  con¬ 
cerning  these  proportions  and  that  consequently  the  minimax 
solution  is  very  inefficient  for  large  samples*  Another  inter¬ 
esting  example  of  the  same  nature  was  studied  by  Levene.  The 
results  of  Robbins  and  Levene  are  considerably  more  startling 
than  the  ones  we  shall  find  here,  since  in  their  examples  there 
is  no  assumption  of  any  a  priori  probabilities.  On  the  other 
hand  their  results  are  more  difficult  to  interpret  since  there 
the  parameter  space  changes  with  the  sample  so  that  there  is  no 
clear-cut  fixed  frame  of  reference. 
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3.  An  example . 

Suppose  we  are  interested  in  classifying  n  plants 
according  to  their  genetic  composition  with  respect  to  a  single 
gene  (a, A),  We  assume  that  the  joint  distribution  of  certain 
measurements  is  known  for  each  of  the  possible  genotypes,  and 
is  the  same  for  dominants  and  hybrids.  The  plants  come  from 
the  cross  of  a  hybrid  with  a  plant  that  was  either  recessive 
or  hybrid.  Hence  it  is  known  that  the  plants  under  considera¬ 
tion  constitute  a  sample  from  a  binomial  distribution  where 
the  probability  p  of  any  one  plant  being  recessive  is  either 
1/4  or  1/2. 

As  throughout  the  paper  we  assume  that  the  loss  re¬ 
sulting  from  wrong  classifications  is  measured  by  the  number 
of  these  incorrect  decisions,  so  that  we  want  to  minimize  the 
expected  number  of  misclassifications.  If  we  adopt  the  minimax 
point  of  view  it  is  easily  seen  that  we  shall  act  as  if  p 
were  known  to  have  that  one  of  the  values  1/4,  1/2,  say  pQ 
which  has  the  greater  Bayes  risk.  (The  Bayes  risk  correspond¬ 
ing  to  a  value  pQ  of  p  is  the  minimum  expected  number  of 
misclassifications  that  can  be  achieved  when  p  is  known  to 
be  equal  to  Pq«)  Each  plant  is  then  classified  without  re¬ 
gard  to  the  measurements  on  the  other  plants  in  such  a  way  as 
to  maximize  the  probability  of  its  correct  classification. 

Clearly,  if  n  is  at  all  large  this  procedure  is 
very  unreasonable.  For  we  can  then  determine  with  near  cer¬ 
tainty  whether  p  =  1/4  or  p  =  1/2.  In  one  case  we  shall 
proceed  as  before,  while  in  the  other  we  shall  modify  our  pro- 
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eedure.  As  a  result  of  this  modification  there  will  of  course 
be  a  slight  increase  in  risk  when  p  =  p  .  This  stems  from 
the  fact  that  there  is  a  small  but  positive  probability  of 
having  decided  on  the  wrong  value  of  p.  However }  this  in¬ 
crease  is  balanced  by  a  very  substantial  decrease  in  risk  when 
p  has  the  other  value. 

Let  us  now  consider  the  problem  quantitatively.  We 
are  concerned  with  random  variables  (presumably  vector valued) : 

The  variable  has  probability  density 

Pq_^(x).  The  9's  are  independent  random  variables,  each 

capable  of  taking  on  the  value  1  or  0.  The  probability 
p  =  PCGj.  =  1)  is  independent  of  i,  and  it  is  known  that 
either  p  =  1/2  or  p  =  1/4.  The  problem  is  to  classify  each 
Xj,  into  category  £  or  as  ©i  is  1  or  0. 

If  p  were  known,  we  would  classify  into  6^  if 

pl(Xi)  q 
p^CxJT  p 

and  into  ^  if  the  opposite  inequality  holds.  The  ex¬ 
pected  number  of  misclassif ications  in  this  case  is  a  ,n 

ir 

where 


a 

p 


qp/fi 

vO 


Pl(X) 


+  P  P 


/PqCX) 


< 


The  minimax  procedure  clearly  is  the  one  appropriate  to  that 
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value  pQ  of  p  (1/4  or  1/2)  for  which  a^  is  larger*  To  be 
specific,  let  us  assume  that  Pq  =  1/2. 

Let  Y  be  the  number  of  variables  that  are 

classified  into  <5-^  by  the  minimax  procedure.  Then  Y  is 
the  number  of  successes  in  n  independent  trials  with  con¬ 
stant  probability 

p(l-a)  +  (1-p)  a  =  a  +  p(l-2a) 

I  -  a 

of  success.  Hence  A  is  a  consistent  estimate  of  p. 

1  -  2a 

Let  us  replace  the  minimax  procedure  by  the  followings 

use  the  minimax  procedure 

use  the  procedure  appropriate  for  p  *  1/4. 

To  compute  the  risk  of  this  new  procedure  suppose  first  that 
p  =  1/2.  Then  the  expected  number  of  misclassifications  is 


Y  a 

If  -a.".-  >  3 

1  -2a  E 


If 


I  -  a  _ 

JS _ <  1 

1  -2a  8 


(1) 


/Pi  (X) 


>  k 
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♦  p.i£i(x) 


J  - -a 


l(p£5y  <  k! -ftaa  >  sj 


'2- a 

+  P(  n —  < 
,1-2  a 


P  =  |jn/5n 


where  B  -  1. 
'n 


It  is  easily  seen  that 


Hence  the  second  term  of  (1)  tends  to  zero.  In  the  first  term, 

the  first  factor  tends  to  1,  while  the  last  factor  tends  to  the 

sum  of  the  unconditional  probabilities.  Thus  the  ratio  of  the 
risk  to  the  minimax  risk  tends  to  1  as  n— *oo. 

An  exactly  analogous  argument  shows  that  when  p  =  1/4 
the  ratio  of  the  new  risk  to  the  risk  of  the  Bayes  procedure  cor¬ 
responding  to  p  =  1/4  also  tends  to  1. 

We  have  used  here  for  simplicity  the  frequency  Y/n 
to  decide  between  p  =  1/2  and  p  =  1/4.  However  more  sensitive 
methods  are  available,  and  one  should  expect  these  to  yield 
better  results  also  for  the  classification  problem.  Thus  one 
might  decide  for  p  =  1/2  if 

n  PQ(X  )  +  Pi 

5  log  p0(xP  +3pl(v  >  k 

where  k  is  some  suitable  constant. 
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It  should  also  be  pointed  out  that  although  the 
procedure  discussed  here  has  good  asymptotic  properties,  it 
is  not  admissible.  In  fact  it  is  easy  to  obtain  the  totality 
of  admissible  procedures  since  by  Wald’s  theory  00  this  co¬ 
incides  in  our  case  with  the  totality  of  Bayes  solutions.  But 
there  are  only  two  parameter  points:  p  =  1/2  and  p  =  1/4. 
Hence  a  Bayes  formulation  assumes  probabilities  (*  =  PCp^), 

1-^  =  PCprri)^  and  the  class  of  all  Bayes  solutions  is  a  one- 
parameter  family,  one  solution  for  each  value  of  (°  . 

These  Bayes  solutions  are  easy  to  obtain  as  follows. 
Any  classification  procedure  of  n  items  into  two  categories 
is  a  vectorvalued  function  (fc(x)  = 

0  -  ($^(x)  ^  1,  If  x  is  observed,  the  i-th  item  is  classified 
into  £  with  probability  (^(x),  into  with  probability 

1  -  (J)^(x).  Instead  of  minimizing  the  Bayes  risk  of  a  procedure 
($  corresponding  to  some  given  value  ,  we  shall  maximize 
the  expected  number  of  correct  decisions,  which  is  given  by 
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E 


E 


d>  (x)  + 

il 


•••  +  (!>.  (x)  +  i-(J> .  (x)  +  ••• 

xk  °1 


+  i-d>. 

^n-k 


n-k 


with  the  summation  (ip***,!^.)  extending  over  all  combinations 
of  k  integers  out  of  (1?  •  •  *,n)  and  where  3i>*“>jn-k 
denotes  the  remaining  integers. 

Thus  we  have  to  maximize  an  expression  of  the  form 


n 

i 


Pi(x)  d/<  (x) 


which  is  achieved  by  setting 


d>3(x)  =  1 


<Mx)  =  o 

V 


whenever 


whenever 


XI  ai^  pi(x)  >  0 
XI  ai;j  Pi^x)  <  °* 


Unfortunately,  although  it  is  easy  to  write  this 
down  explicitly,  the  resulting  procedure  does  not  seem  very 
manageable . 


4. 


The  inefficiency  for  large  samples  of  the  minimax 
solution,  that  we  found  here  is  not  an  isolated  phenomenon. 
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There  are  many  cases  of  a  sequence  of  problems  TTn  for 
each  of  which  a  minimax  solution  exists  that  is  unique  and 
hence  admissible.  However  for  large  n  there  exist  other 
procedures  which  at  the  expense  of  a  slight  increase  of  the 
risk  functions  for  some  parameter  values  reduces  the  risk  very 
substantially  for  other  values  of  the  parameter. 

These  considerations  lead  to  the  following  defini¬ 
tion.  Let  the  distribution  of  the  observable  random  variables 
depend  on  a  parameter  0,  and  denote  the  risk  function  of  a 
decision  procedure  $  by  R^(©).  Then  we  shall  say  that  a 

sequence  of  decision  procedures  is  asymptotically  non-admis- 

♦ 

sible  if  there  exists  a  sequence  of  procedures  $  such 
that 


(i) 


Rc*  (9) 

*  n  <  -j 

Rs  (©7  "  1 

2>  n 


for  all  9 


with  strict  inequality  holding  for  some  0.  (The  results  of 
section  2  show  that  in  the  example  considered  there  the  mini¬ 
max  procedure  is  asymptotically  non-admissible . )  In  analogy 
with  the  above  definition  one  can  define  the  notion  of  asymp¬ 
totic  admissibility  and  an  asymptotic  minimax  procedure;  this 
latter  notion  was  introduced  by  Wald  [9. lJ  .  It  also  seems  useful 
to  define  the  following  concept: 

A  sequence  of  procedures  £  is  said  to  be  con¬ 

sistent  if  for  each  0 
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11m 
n  — *oo 


(9)  - 


=  0 


We  can  then  state  the  following  obvious  result.  If  for  each 


6 


lim  /inf 
n  — *oo  (  S  n 


*inW) 


>  0 


and  if  $°n  is  consistent,  then  is  asymptotically  ad¬ 

missible  and  minimax. 

We  shall  now  consider  the  asymptotic  theory  of  the 

simultaneous  classification  of  a  large  number  of  individuals. 

Let  X  be  independently  distributed  with  density  p  (x) 
i  wi 

where  the  ©^  are  independent  random  variables  taking  on 

the  values  1,0  with  probabilities  p,q  respectively.  It  is 

desired  to  classify  each  X^  according  to  its  ©^  in  such 

a  way  as  to  minimize  the  expected  number  of  misclassifications . 

For  simplicity  assume  that  ^ has  a  continuous 

Pq  \X  / 

distribution  both  v/hen  ©^  is  1  and  0.  The  minimax  pro¬ 
cedure  classifies  X^  into  £  if  and  only  if 


P-,  (X,) 

— .--T  >  k,  where  k  is  determined,  by  the  condition 

w 
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Theorem. 

Let  Y  be  the  number  of  Xfs  that  are  classified  £  ^  by 
the  minirnax  procedure.  Then  as  n— ^oo  the  following  sequence 
of  procedures  is  consistent  and  hence  asymptotically  admissible 


ana  minimax. 


Classify  Xi  into  if  and  only  if 


w  _  (1'a)  •  r 


0  i' 


1  -  a 
n 


(1-a)  -  Y 

Proof.  We  note  first  that - &  is  a  con- 

I  - « 

sistent  estimate  of  q/p.  For  Y/n  is  the  frequency  of 
success  in  n  independent  trials  with  constant  probability 


’  p<w*T 


p  (X)  \  /p  (X) 

— ' U..,  .  >  k  +  P  Pi  ■  > 


ppHwM>k 


=  q  a  +  p(l-a)  =  a  +  p(l-2a) 


I  -  a 

of  success*  Therefore  -J1  _  . ..  tends  in  probability  to  p. 

1  -  2a 


Cl-a)  -  j 

1  -  a 


1  -  2a 
Y 

n  “  a 
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and  hence  tends  in  probability  to 


l  -  i  =  a 
P  P 


Now  the  expected  number  of  misclassifications  when 
p  is  true  and  we  use  the  proposed  procedure  is 


4P°(SS 


(I-b)-Y 
>  _ _n 

1-° 


j  +  pPi{ p^r'TTT 


where  PQJ  indicate  the  distribution  of  X^  while  it  is 
assumed  that  p  is  the  probability  of  ©  being  1. 

*  X 

If  p  were  known  we  could  use  the  above  procedure 

a-rt'i  .  Y 

TT  _ 1 _ j  ~  /„  tt„  ~  —  -j  „ 


with  the  quantity 


n  -  ‘ 


JI  replaced  by  q/p.  Hence  in 


order  to  prove  our  result  we  need  only  show  that  as  n  — ^oo 


(2)  P. 


/P .00  U-«)-J  J  \  /p,  (X. )  \ 

>  -JTZ  I  ^pi(pjuqr  >  y  for  1  =  1>°- 


Now  let  Y  1  be  the  number  of  X’s  among  X2, that  satisfy 
P-i  (X)  «  . 

. =-r&v  >  k.  Then  Y  -Y  ^  1  and  it  is  clear  that  we  can  re- 

P0(X) 


place  the  left  hand  side  of  (2)  with 


p  ..  (1*a)  •  I 

HW  ~T~ 
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But  X^  and  Y  are  independent,  and  the  result  follows  from 
the  following  fact: 

If  X>  Un  are  independent,  Un— *a  in  probability, 

and  a  is  a  continuity  point  of  X,  then 

P(X  >  U  )— >  a), 
n 

It  should  be  pointed  out  that  the  theorem  would 
presumably  remain  valid  if  we  replaced  our  estimate  of  q/p 

by  any  other  consistent  estimate.  The  next  stage  would  be 
to  consider  the  speed  with  which  the  limiting  risk  is  ap¬ 
proached,  as  one  uses  different  estimates  of  q/p. 

We  have  stated  the  theorem  for  the  case  of  only  two 
categories.  The  extension  to  the  case  of  s  categories  is 
easy  and  we  shall  o1  ly  sketch  it  briefly.  We  now  assume  that 
each  0^  can  take  on  s  values,  say  1,2, ••♦,s,  and  that  each 
Xj  is  to  be  classified  into  one  of  the  classes  C.  .•  •  •.  (5 
according  to  the  value  of  0^.  If  Hi  is  the  a  priori 
probability  of  any  one  9  taking  on  the  value  j  the  as¬ 
sociated  Bayes  solution  classifies  an  X  into  if 

(3)  ir±  PiCx)  =  max  ^  p  (x)}. 

D 

Let  R^  be  the  set  of  points  x  for  which  (3)  holds,  and  let 

a.  .  =  P.CXcR  ). 

ID  D  i 

If  Y  is  the  number  of  X's  classified  as  /  under  the 
i  i 
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Bayes  procedure  associated  with  some  particular  set  of  a  priori 

probabilities  (  TT  , *  •  • ,  7T  ),  then  Y  is  the  number  of  out- 

i  S  1 

comes  i  in  n  multinomial  trials  in  which  the  outcome  i 
has  constant  probability 

-TT  ^ 

Assume  that  TT^  >  0  for  i  =  l,«**,s  and  let  ("i, 
be  the  solutions  of  the  system  of  equations 


o 


Then  the  procedure  that  classifies  Xr  into  if 


*1  VV  ■  T  ^  W 

is  consistent. 

5.  Str^tiTi^ation^ 

It  frequently  happens  that  there  is  more  information 
available  than  was  assumed  in  the  last  section.  Suppose  namely 
that  the  population  is  stratified  (for  example  by  sex,  age,  pre¬ 
vious  training,  etc.)  We  still  have  to  classify  each  individual 
into  one  of  a  number  of  classes,  however  the  proportions  of 
individuals  in  the  various  categories  presumably  differ  among 
the  various  strata. 

Let  us  consider  the  simplest  case  of  two  categories 
and  two  strata.  The  individuals  of  the  two  strata  will  be  de- 
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noted  by  Y  and  Z  respectively.  Each  Y^  has  a  probability 
density  p^(y)  ,  ^  =  0  or  1;  each  Z ^  has  a  density 

Pe  (2)  y  l  ~  0  or  1,  Let  us  put 
£i  i 


PO^  =  1)  =  p 

p(?i  =  1)  -  p' 


and  let  7T  be  the  proportion  of  Y’s  in  the  total  population. 
It  is  clear  that  if  p  ^  p  ’  the  procedure  discussed  in  the 
last  section,  which  takes  no  account  of  the  stratification, 
will  lose  its  asymptotic  properties.  For  let  us  assume  for  a 
moment  that  p,p', IT  are  known.  Then  the  (unique)  optimum  pro¬ 
cedure  will  differ  in  its  treatment  of  the  Y’s  and  Z’s. 

On  the  other  hand  the  procedure  that  minimizes  the  risk  under 
the  assumption  that  all  individuals  are  drawn  at  random  from 
a  population  with  a  proportion  77-p  +  (1-7T)  p*  of  individuals 
belonging  to  class  A  ,  will  classify  all  of  the  individuals 

v-i? 

according  to  the  same  rule,  and  hence  has  a  higher  risk.  Since 
the  asymptotic  risk,  when  the  various  parameters  are  estimated, 
has  been  shown  to  approach  the  Bayes  risk  when  they  are  known, 
the  result  follows. 

It  is  also  clear  now  that  in  the  present  case  the 
following  procedure  will  be  consistent.  We  estimate  separately 
p  and  pr,  say  by  p  and  p*  and  classify  a  Y  into  (J-  if 
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A  ,  P.(Z)  A| 

>  P  «  and  a  z  into  Ci  l >  p  .  As  a  matter 

PQW  ’  ^1  pQ(Z) 

of  fact  this  procedure  retains  consistency  even  when  P  =  p1, 
however  this  is  an  indication  of  the  weakness  of  the  definition 
rather  than  the  quality  of  the  procedure. 

Taking  account  of  stratification  not  only  improves 
the  risk  function  but  it  also  avoids  the  necessity  of  making 
assumptions  about  how  the  sample  is  divided  between  the  strata. 
In  the  first  treatment  we  assumed  random  sampling  from  the  total 
population.  However  if  the  stratification  is  such  that  the 
strata  differ  markedly  in  the  proportions  of  the  various  cate¬ 
gories  this  assumption  is  likely  to  be  fallacious. 

While  thus  from  a  statistical  point  of  view  there 
are  considerable  advantages  in  not  using  the  minimax  procedure, 
this,  at  least  in  certain  problems,  also  entails  serious  dis¬ 
advantages  of  an  ethical  nature.  While  the  issue  is  brought  out 
particularly  clearly  in  connection  with  stratification,  it  is 
actually  present  in  the  whole  discussion.  If  each  individual 
can  be  either  0  or  1  and  if  some  significance  attaches  to 
the  classification,  one  feels  strongly  that  each  person  should 
be  classified  on  the  basis  of  his  own  performance  without  regard 
to  that  of  the  other  individuals  being  classified. 


At  first  one  may  feel  that  the  fault  lies  with  the 
loss  function.  We  have  stated  it  as  our  task  to  minimize  the 
total  average  number  of  misclassif ications .  However  exactly  the 
same  phenomenon  occurs  if  we  are  interested  in  classifying  only 
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individual  i.  If  we  want  to  minimize  the  probability  of  mis- 
classif ications  we  will  estimate  the  proportion  of  0’s  in 
the  population,  and  proceed  as  we  did  before. 

The  moral  conflict  arises  with  the  assumption  of  ran¬ 
dom  sampling  from  the  population.  The  individual  does  not 
consider  himself  drawn  at  random  from  a  population.  For  him 
0  is  not  a  random  variable  but  a  parameter.  Thus,  if  we 
want  to  meet  this  objection  we  have  to  forego  the  advantage 
of  the  assumption  of  random  sampling  and  treat  the  0’s  as 
parameters . 

It  might  seem  that  even  then  the  difficulty  re¬ 
mains  because  of  the  possibilities  brought  out  by  Robbins. 

This  is  however  not  so.  The  phenomenon  described  by  Robbins 
occurs  if  we  express  the  risk  in  terms  of  the  frequency  of 
0's  in  the  group  that  is  being  classified.  But  this  is 
inappropriate  if  we  are  concerned  with  the  classification 
of  the  single  individual.  Then  0^  is  0  or  1,  and  the 
risk  must  be  expressed  in  terms  of  these  two  possibilities 
and  not  an  extraneous  frame  of  reference. 

6.  On  a  general  class  of  problems. 

The  problems  discussed  here  have  certain  features 
in  common  with  a  large  class  of  statistical  problems.  As  we 
shall  indicate,  it  seems  likely  that  the  results  we  obtained 
in  the  special  cases  apply  more  generally. 

In  the  examples  we  considered  the  distribution  of 
the  observable  random  variables  was,  as  is  usually  the  case, 
only  partially  known.  However  —  and  here  they  differ  from 
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the  classical  problems  of  statistics  —  even  if  the  distribu¬ 
tion  were  known  this  would  still  not  imply  knowledge  of  the 
correct  decision  since  this  depends  on  the  values  of  some  un¬ 
observable  random  variables.  The  same  situation  occurs,  fbr 

example,  in  all  prediction  problems. 

Suppose  in  general  that  we  are  concerned  with  a  situ¬ 
ation  in  which  a  decision  is  to  be  made  on  the  basis  of  ob¬ 
servable  random  variables  X^,  “*j^n  whose  joint  distribution,  ■ 
for  all  n,  depends  on  a  certain  parameter  9.  It  is  assumed 
that  as  n  — *oo  one  can  estimate  0  consistently.  The  correct 
decision  depends  not  directly  on  0  but  on  certain  unobservable 
random  variables  the  distribution  of  which  also  involves  9. 

In  such  cases  it  seems  to  be  true  rather  generally 
that  the  minimax  procedure  is  asymptotically  inadmissible. 

For  suppose  that  9  were  known  and  let  9q  be  that 
value  of  9  to  v/hich  corresponds  the  biggest  (Bayes)  risk.  In 
some  cases  the  minimax  solution  is,  for  all  n,  the  Bayes 
solution  corresponding  to  this  worst  value  of  the  parameter. 

(This  is  the  case  in  the  example  considered  in  section  2  and 
in  the  prediction  of  the  outcome  of  a  single  binomial  trial 
(see  [lOj ) . )  In  other  cases  the  least  favorable  distribution 
(if  one  exists)  is  not  concentrated  at  this  one  value  since  it 
takes  into  account  both  the  difficulty  of  determining  the  correct 
value  of  0  and  the  difficulty  of  determining  the  correct  value 
of  the  non-observable  variables  when  9  is  known.  However,  as 
n— £oo,  the  difficulty  of  determining  the  correct  value  of  0 
gradually  disappears,  and  hence  the  least  favorable  distribution 
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does  tend  to  concentrate  around  the  "worst”  value  of  9.  The 
asymptotic  inadmissibility  of  the  Bayes  solution  corresponding 

to  the  least  favorable  distribution  (i.e.,of  the  minimax 
procedure)  now  follows  as  before  from  the  fact  that  we  can 

determine  the  time  value  of  9  quite  accurately  and  hence 
do  not  have  to  take  the  pessimistic  attitude  of  the  minimax 
solution. 

At  the  same  time  we  see  that  the  Bayes  procedure 

A 

corresponding  to  the  estimated  value  9  of  9  is  consistent 
and  hence  asymptotically  admissible  and  minimax.  For  the  con¬ 


tribution  to  the  risk  resulting  from  having  the  wrong  value 


of  9  tends  to  zero,  and  hence  the  total  risk  tends  to  the 
risk  one  would  be  left  with  even  if  9  were  known. 

In  some  problems  of  the  kind  being  considered  one 
can  avoid  the  difficulties  of  the  minimax  procedure  by  adopting 
the  notion  of  L.J.  Savage,  of  minimizing  the  maximum  regret. 

It  is  easy  to  see,  for  example,  in  prediction  problems  with 
squared  error  as  loss  function  that  the  prediction  of  a  random 
variable  that  minimizes  the  maximum  regret  is  the  same  as  the 
minimax  estimate  of  E(Y). 

It  should  finally  be  pointed  out  that  the  asymptotic 
inadmissibility  of  minimax  procedures  may  also  occur  in  classical 
problems  where  the  difficulties  discussed  in  the  present  section 
do  not  arise.  An  example  in  question  is  the  estimation  of  a 
binomial  probability  QlcQ . 
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The  classification  of  individuals  is  considered,  and  for  some  simple  problems 
minimax  procedures  are  obtained.  Other  procedures  are  given  which,  are  nvinimax 
and  admissible  in  the  limit  as  the  number  of  individuals  becomes  larger. 
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